The data contains information on 11 chemical properties of 6500 different bottles of vinho verde wine from northern Portugal. In this question, we try to use two dimensionality reduction techniques (Clustering and PCA) to distinguish wine color and wine quality. Write a summary of how well each technique performes and what results we find.
Our Steps
Clustering Method
use 2 clusters first for color use 7 clusters first for quality
results
PCA Method
apply PCA on the data set
repeat clustering on reduced dimensions
results
Clustering
Color
We start by using clustering method. Since we want to explore whether clustering method can distinguish between reds and whites as well as different levels of quality, we remove the color and quality columns, and rescale other variables to do unsupervised analysis.
Here, we deploy 2 centers, to measure how well clustering is able to distinguish between whites and reds (preferably reds cluster and whites cluster).
## wine1$color
## cluster_2_centers$cluster red white
## 1 24 4830
## 2 1575 68
The plots demonstrate that for each cluster, the points identified in clusters overlap well with actual color. This is backed up by the confusion matrix, the accuracy rate is 0.9858396. Therefore, we can say that Clustering Method is capable of distinguishing the reds from the whites using 2 centers with chemical properties.
Quality
Next for quality, we intend to use 10 centers, corresponding to 1-10 of wine quality. However, since we observe that there are no 1,2 and 10 category for wine quality, it makes more sense to try 7 different clusters corresponding to quality from 3-9.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## wine$quality
## cluster_7_centers$cluster 3 4 5 6 7 8 9
## 1 7 63 471 350 43 2 0
## 2 2 27 269 475 189 31 0
## 3 5 64 446 549 137 27 1
## 4 4 21 77 548 446 97 4
## 5 1 2 20 9 1 0 0
## 6 4 15 200 265 141 14 0
## 7 7 24 655 640 122 22 0
From the table, we can tell that each cluster has different quality levels of wine. Even some cluster has mainly a certain level of quality wine, still it cannot distinguish from 7 levels of wine in an accurate way. So, clustering method is not capable of sorting the higher from the lower quality wines.
PCA
We wish to visulaize 6500 observations with measurements on a set of 11 features, fixed. acidity, volatile.acidity, citric.acid, etc. We could do this by examining two-dimensional scatterplots of the data, each of which contains the 6500 observations’ measurements on two of the 11 features. However, there are \(11*\frac{11-1}{2} = 55\) such scatterplots (pick 2 among 11 features). PCA provides a tool to find a low-dimensional representation of a data set that contains as much as possible of the variation and captures as much as the information as possible.
Specifically here, our goal here is to use PCA to reduce noise and find the most important several properties which can help us clearly distinguish between reds and whites, as well as different quality levels of wine.
First and foremost, we seek for the most important components which have highest proportion of variance among 11 features, as well as draw a graph of the PC from highest variance to the lowest variance. We find that PC1 and PC2 are the most important two features with variance of 0.2754 and 0.2267 respectively.
We also choose K=2 (the same as above Clustering Method). From the following graph, we can see PCA does well in distinguishing reds from whites. The plot shows two clear clusters separating reds from whites with only very few overlap.
With 0.0163152 < 0.9858396, reduce dimensions before clustering to distinguish between whites and reds is worse than simple K-mean clustering.
Move on to attempt at distinguish wine quality, first we conduct PCA on rescaled wine data.
Similar to simple K-mean clustering, PCA does not perform well in distinguishing wines with different quality levels. The graph is blurry. Different quality levels of wine center in the same area with the similar component 1/component 2 variance.
Next, we apply PCA before trying to conduct a 7 cluster. However, as the graph below represents, it does not help us to distinguish between different quality of wine much better than just PCA. Again, we can look at the confusion matrix for PCA K-mean cluster for wine quality.
## wine$quality
## clustPCA2$cluster 3 4 5 6 7 8 9
## 1 4 5 91 123 55 6 0
## 2 5 39 439 592 143 23 0
## 3 5 55 311 532 198 43 0
## 4 7 23 551 539 102 18 1
## 5 2 17 113 530 447 94 4
## 6 0 23 305 255 97 7 0
## 7 7 54 328 265 37 2 0
As we can see, the clusters misidentified many observations, making it ill-suited to predict wine quality.
Conclusion
Clustering method is capable of distinguishing between red wine and white wine, even without applying PCA to reduce noise from 11 chemical properties.
However, Clustering method is incapable of distinguishing amongst different wine quality since the quality distribution is centered heavily around 5 and 6, making it hard to create 7 clusters to differentiate the quality. The optimal k we found to be 3, which further support this argument.
Goal
We want to use Clustering Method and PCA Method to indentify some interesting market segments and provide some insights about how to assist NutrientH2O in understanding its social-media audience better, so that they can hone the messaging more sharply to tagert followers.
Pre-process the Data
We firstly should clean the data a bit. Some columns should be omitted since they have a high tendency to distort the model we build. As far as we are concerned, we choose to ignore these categories: Chatter, Spam, Adult and Uncategorized since we suppose they are useless and inappropriate.
Overview the Data
We can get a rough sense of the whole data set a bit. Photo Sharing and Health Nutrition are the most popular topics for tweets. Apart from allocating the marketing resources tow these two categories, we should not ignore other potential topics which can be explored more. As we can see from the graph, there is a wide range of the other tweets topics from cooking (at approximately 5.5%) to art (at approximately 2%) among followers. We need to try to broaden the marketing to more topics of these. Therefore, we should use Clustering Method to group similar followers and their tweets to better utilize interesting marketing segments for product promotion.
Clustering Method
We would like to find an optimal K to do the clustering. By using CH Index, we can see that K=2 has a max CH. However, we feel that 2 clusters are not enough to make an analysis for the tweets features. We try to use K=6, which seems to be an elbow in this graph, to explore deeper.
Let’s try K=6.
NOTE: Since each time K means Clustering would cluster differently, we cannot make a complete corresponding analysis to the graph we produce. Here, we only summarize our insights and give suggestions based on the graph result we ran for a typical one time.
Cluster 1 Young Adults Before Marriage
Insights
Cluster 1 is related to dating, current events, art, shopping and television/film. It is kind of hard to tell which group has these features. We assume it to be Young Adults Before Marriage. They like to date since they haven’t been married. They are excited about current events, art, shopping and also tv/film stuff.
Suggestions
We would suggest NutrientH2O to do something realted to social relationship. Just like the app Bumble, it offers a platform to let youngsters make online friends/date with people of same interests.
Cluster 2 Working Professionals
Insights
Cluster 2 is related to automotive, computers, travel, news and politics. We presume this cluster is mainly for Working Professionals. They incline to travel a lot for work (such as big 4 auditing department), be fond of high tech, and pay much attention to daily news and politics.
Suggestions
We would suggest NutrientH2O to pay more attention to broadcast up-to-date news related to cars, computers, travel information and politics. These topics of news will surely be eye-catching for those working professionals.
Cluster 3 Fashion Chasers
Insights
Cluster 3 is related to music, photo_sharing, beauty, fashion and cooking. Obviously, this cluster should be classfied as Fashion Chasers. They care much about the fashionable trends. We bet that they use social media a lot to post pohtos related to music, beauty, cooking, etc.
Suggestions
We would suggest NutrientH2O to develop their products and service deeper about social media area. For example, NutrientH2O can hold some online competition about retweets and posts to reward people whose instagram posts have the highest number of sharings.
Cluster 4 Typical Moms
Insights
Cluster 4 is related to school, food, sports fandom, parenting and religion. We can see that this cluster represents the the typical mom, tweeting about things about school, food, sports, parenting, and religion. They spend lots of time volunteering at their kids’ schools, involving their kids’ sports activities as well as going to church every weekend.
Suggestions
We would suggest NutrientH2O to devote their time and energy to those moms who account for a large part of middle-aged parents with lots of spending power. NutrientH2O should definitely focus significant energy and marketing dollars on this group, doing something not only related to their kids but also some things tailored to their features. For instance, NutrientH2O can develop a certain product which can be a leisure entertainment for those moms when they are waiting for kids from school.
Cluster 5 Young Students
Insights
Cluster 5 is related to music, television/film, sports, online games and college/univeristy. Obviously, this cluster is mainly regarded as young students who attend schools and love to listen to music, watch films, play sports and online games.
Suggestions
We would suggest NutrientH2O to pay more attention to offering popular music(not classical since we suppose youngsters prefer K-pop, rap, etc), films related to nowadays trends, sports which are more bloody and firece (not super slow pace or relaxing sports which are tailored to the old), online games which have more visual effects since youngsters love reckless competitions.
Cluster 6 Fit Generation
Insights
Cluster 6 is related to food, eco, outdoors, personal fitness and health nutrition. We can speculate that this cluster is mainly regarded as a Fit Generation. Those people focus on a balanced lifestyle by eating healthy and nutritious food, participating outdoor activities, doing atheletic sports, etc.
Suggestions
We would suggest NutrientH2O to advertise more eco-friendly, healthy and nutritious products which are attracted by those people.
PCA Method
We pick up the first two most importants components to make further analysis. These six plots show six groups we indentified when we use Clustering Method into PCA two-domention result. A plot represent a twitter user, and the more red it is, the more percentage of the user’s posts relate to the corresponding group.
In other words, the more red the graph is, we can suppose that more users are in the each hypothetical group we have defined. So specifically here, relatively, there are more red dots in the graph of “Fit Generation”, “Fashion Chasers” and “Young Studnets”. Hence, we can say that these groups are clustered in a relatively more precise way.
Conclusion
The main insights and suggestions are discussed above each cluster and PCA. Here we would like to add a few points to make a conclusion. Clustering and PCA method works well in market segmentation. This output can help NutrientH20 better target its audience and focus their social media marketing efforts on a more defined and targeted group of people.
For the given data set, we first examine the distribution of items bought. From the graph below, the most popular good is whole milk, followed by other vegetables, rolls and buns, soda then yogurt.
We start by looping over support ranging from 0.009 to 0.05 and confidence from 0.2 to 0.5. For these different combinations, we look for the one giving us the maximum average lift. It means that there is a high association between the items in the basket. Our goal is to get a high lift value with maximum support.
## sup con avg_inspection
## 27 0.019 0.45 2.014596
## 32 0.019 0.50 2.007235
## 22 0.019 0.40 1.863106
## 28 0.029 0.45 1.850203
## 23 0.029 0.40 1.793654
## 24 0.039 0.40 1.787991
## 17 0.019 0.35 1.767332
## 19 0.039 0.35 1.733120
## 12 0.019 0.30 1.730407
## 7 0.019 0.25 1.711049
The results we get are best for support = 0.009 and confidence = 0.5 with a max average lift of 2.2255. Nevertheless, increasing the support will ensure higher transactions containing items of interest. The trade off here could be the decrease in lift, which what we see here. But, a slightly higher support ensures many more transactions/rules with a minimum effect on lift. Thus, we decide to choose support to be 0.01 and confidence to be 0.4.
## lhs rhs support confidence lift count
## [1] {onions} => {other vegetables} 0.01423488 0.4590164 2.372268 140
## [2] {hamburger meat} => {other vegetables} 0.01382816 0.4159021 2.149447 136
## [3] {chicken} => {other vegetables} 0.01789527 0.4170616 2.155439 176
## [4] {whipped/sour cream} => {other vegetables} 0.02887646 0.4028369 2.081924 284
## [5] {root vegetables} => {other vegetables} 0.04738180 0.4347015 2.246605 466
## [6] {curd,
## yogurt} => {whole milk} 0.01006609 0.5823529 2.279125 99
## [7] {pork,
## whole milk} => {other vegetables} 0.01016777 0.4587156 2.370714 100
## [8] {butter,
## other vegetables} => {whole milk} 0.01148958 0.5736041 2.244885 113
## [9] {butter,
## whole milk} => {other vegetables} 0.01148958 0.4169742 2.154987 113
## [10] {domestic eggs,
## other vegetables} => {whole milk} 0.01230300 0.5525114 2.162336 121
## [11] {domestic eggs,
## whole milk} => {other vegetables} 0.01230300 0.4101695 2.119820 121
## [12] {whipped/sour cream,
## yogurt} => {other vegetables} 0.01016777 0.4901961 2.533410 100
## [13] {whipped/sour cream,
## yogurt} => {whole milk} 0.01087951 0.5245098 2.052747 107
## [14] {whipped/sour cream,
## whole milk} => {other vegetables} 0.01464159 0.4542587 2.347679 144
## [15] {other vegetables,
## pip fruit} => {whole milk} 0.01352313 0.5175097 2.025351 133
## [16] {pip fruit,
## whole milk} => {other vegetables} 0.01352313 0.4493243 2.322178 133
## [17] {citrus fruit,
## root vegetables} => {other vegetables} 0.01037112 0.5862069 3.029608 102
## [18] {citrus fruit,
## whole milk} => {other vegetables} 0.01301474 0.4266667 2.205080 128
## [19] {root vegetables,
## tropical fruit} => {other vegetables} 0.01230300 0.5845411 3.020999 121
## [20] {root vegetables,
## tropical fruit} => {whole milk} 0.01199797 0.5700483 2.230969 118
## [21] {tropical fruit,
## yogurt} => {other vegetables} 0.01230300 0.4201389 2.171343 121
## [22] {tropical fruit,
## yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 149
## [23] {tropical fruit,
## whole milk} => {other vegetables} 0.01708185 0.4038462 2.087140 168
## [24] {root vegetables,
## yogurt} => {other vegetables} 0.01291307 0.5000000 2.584078 127
## [25] {root vegetables,
## yogurt} => {whole milk} 0.01453991 0.5629921 2.203354 143
## [26] {rolls/buns,
## root vegetables} => {other vegetables} 0.01220132 0.5020921 2.594890 120
## [27] {rolls/buns,
## root vegetables} => {whole milk} 0.01270971 0.5230126 2.046888 125
## [28] {root vegetables,
## whole milk} => {other vegetables} 0.02318251 0.4740125 2.449770 228
## [29] {other vegetables,
## yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 219
After picking values for support and confidence, we rerun the apriori. We subset only rules whose lifts are larger than 2 because the mean is very close to 2, we can eliminate weakly associated rules as well. This gives us a set of 29 strongly associated rules. From the sample, whole milk appeares the most followed by other vegetables. A large percent of people with various baskets are almost always interested in buying whole milk and/or other vegetables.
## Available control parameters (with default values):
## main = Graph for 29 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
The graph gives us a depiction of the importance of the various basket items. Whole milk and other vegetables appear to be the most common items are in the middle with branches extending outwards to other items.
The next one gives us a two-key plot, not for only the subset but the whole set of values as a function of support and confidence.
subrules <- sample(subset_groc, 20)
plot(subset_groc, method="matrix", measure="lift", control=list(reorder='support/confidence'))
## Itemsets in Antecedent (LHS)
## [1] "{root vegetables}" "{whipped/sour cream}"
## [3] "{root vegetables,whole milk}" "{other vegetables,yogurt}"
## [5] "{chicken}" "{tropical fruit,whole milk}"
## [7] "{whipped/sour cream,whole milk}" "{onions}"
## [9] "{hamburger meat}" "{tropical fruit,yogurt}"
## [11] "{root vegetables,yogurt}" "{other vegetables,pip fruit}"
## [13] "{pip fruit,whole milk}" "{citrus fruit,whole milk}"
## [15] "{rolls/buns,root vegetables}" "{domestic eggs,other vegetables}"
## [17] "{domestic eggs,whole milk}" "{root vegetables,tropical fruit}"
## [19] "{butter,other vegetables}" "{butter,whole milk}"
## [21] "{whipped/sour cream,yogurt}" "{citrus fruit,root vegetables}"
## [23] "{pork,whole milk}" "{curd,yogurt}"
## Itemsets in Consequent (RHS)
## [1] "{other vegetables}" "{whole milk}"
plot(subrules, method="graph", control=list(layout=igraph::in_circle()))
The final graphs are a matrix representation of the matrix of rules with the color scale showing the lift. We can match the matrix to the lift values above and get the exact items in the basket.
Hence, these visualizations depict the strength of the associations.